ZINB_MODEL

Overview

The ZINB_MODEL function fits a Zero-Inflated Negative Binomial (ZINB) regression model for count data that exhibits both overdispersion and excess zeros. Zero-inflated models are appropriate when the observed number of zeros exceeds what would be expected from a standard count distribution, a situation that arises in many real-world scenarios such as insurance claims, healthcare utilization, ecological surveys, and manufacturing defects.

The ZINB model is a mixture distribution that combines two data-generating processes. The first process, modeled with a binary outcome (logit or probit), determines whether an observation is a “structural zero” (always zero) or comes from the count process. The second process uses a negative binomial distribution to generate counts, which naturally accommodates overdispersion where the variance exceeds the mean. This two-part structure allows the model to handle both the excess zeros and the overdispersion commonly observed in count data.

This implementation uses the statsmodels library, specifically the ZeroInflatedNegativeBinomialP class. The function estimates separate coefficients for the count process (predicting the magnitude of non-zero counts) and the inflation process (predicting the probability of structural zeros). The negative binomial component includes a dispersion parameter that controls the variance structure, with the P-parameterization allowing flexibility in modeling different types of overdispersion. For more information, see the statsmodels ZINB documentation and the statsmodels GitHub repository.

The mathematical formulation for the ZINB model is:

\Pr(Y = 0) = \pi + (1 - \pi) \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha}
\Pr(Y = y_i) = (1 - \pi) \frac{\Gamma(y_i + \alpha)}{\Gamma(\alpha) \cdot y_i!} \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha} \left(\frac{\mu}{\mu + \alpha}\right)^{y_i}, \quad y_i = 1, 2, 3, \ldots

where \pi is the probability of being in the zero-inflated group (structural zero), \mu is the expected count from the negative binomial process, and \alpha is the dispersion parameter.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ZINB_MODEL(y, x, x_inflate, fit_intercept, alpha)
  • y (list[list], required): Dependent variable containing overdispersed count data (non-negative integers).
  • x (list[list], required): Independent variables for the count process. Each column represents a predictor.
  • x_inflate (list[list], optional, default: null): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x.
  • fit_intercept (bool, optional, default: true): If true, adds an intercept term to both the count and zero-inflation processes.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (must be between 0 and 1).

Returns (list[list]): 2D list with model results and statistics, or error string.

Examples

Example 1: Basic ZINB model with single predictor

Inputs:

x y
0 0
0.5 0
1 0
1.5 1
2 2
2.5 3
3 0
3.5 1
4 4
4.5 5
5 0
5.5 2
6 3
6.5 6
7 0
7.5 1
8 2
8.5 7
9 8
9.5 4

Excel formula:

=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.3785 1.3471 0.281 0.7787 -2.2618 3.0188
x1 -0.4067 0.3391 -1.1993 0.2304 -1.0714 0.258
inflate_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.2241 0.515 0.4351 0.6635 -0.7854 1.2336
x1 0.1573 0.0723 2.1749 0.0296 0.0155 0.299
model_statistics
log_likelihood -36.4553
aic 82.9105
bic 87.8892
dispersion 0.0657

Example 2: ZINB with separate zero-inflation predictors

Inputs:

x x_inflate y
0 0.5 0
0.5 0.8 0
1 1.1 1
1.5 1.4 0
2 1.7 2
2.5 2 3
3 2.3 0
3.5 2.6 1
4 2.9 4
4.5 3.1999999999999997 5
5 3.5 0
5.5 3.8 2
6 4.1 3
6.5 4.4 0
7 4.7 6
7.5 5 1
8 5.3 2
8.5 5.6 7
9 5.8999999999999995 0
9.5 6.2 8

Excel formula:

=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0.5;0.8;1.1;1.4;1.7;2;2.3;2.6;2.9;3.1999999999999997;3.5;3.8;4.1;4.4;4.7;5;5.3;5.6;5.8999999999999995;6.2}, {0;0;1;0;2;3;0;1;4;5;0;2;3;0;6;1;2;7;0;8})

Expected output:

count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.6525 1.2652 0.5157 0.6061 -1.8274 3.1323
x1 -0.3937 0.3451 -1.1406 0.254 -1.0701 0.2828
inflate_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.186 0.4655 0.3995 0.6895 -0.7265 1.0984
x1 0.1724 0.0676 2.551 0.0107 0.04 0.3049
model_statistics
log_likelihood -36.4821
aic 82.9642
bic 87.9428
dispersion 0.0105

Example 3: ZINB model without intercept term

Inputs:

fit_intercept x y
false 0 0
0.5 0
1 0
1.5 1
2 2
2.5 3
3 0
3.5 1
4 4
4.5 5
5 0
5.5 2
6 3
6.5 6
7 0
7.5 1
8 2
8.5 7
9 8
9.5 4

Excel formula:

=ZINB_MODEL(FALSE, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
x1 -0.3585 0.228 -1.5725 0.1158 -0.8053 0.0883
inflate_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
x1 0.1866 0.0265 7.0511 0 0.1347 0.2385
model_statistics
log_likelihood -36.5492
aic 79.0985
bic 82.0857
dispersion 0.0907

Example 4: ZINB model with 90% confidence intervals

Inputs:

alpha x y
0.1 0 0
0.5 0
1 0
1.5 1
2 2
2.5 3
3 0
3.5 1
4 4
4.5 5
5 0
5.5 2
6 3
6.5 6
7 0
7.5 1
8 2
8.5 7
9 8
9.5 4

Excel formula:

=ZINB_MODEL(0.1, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.3785 1.3471 0.281 0.7787 -1.8373 2.5943
x1 -0.4067 0.3391 -1.1993 0.2304 -0.9645 0.1511
inflate_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.2241 0.515 0.4351 0.6635 -0.6231 1.0713
x1 0.1573 0.0723 2.1749 0.0296 0.0383 0.2762
model_statistics
log_likelihood -36.4553
aic 82.9105
bic 87.8892
dispersion 0.0657

Python Code

import math
import warnings
import numpy as np
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP as statsmodels_zinb

def zinb_model(y, x, x_inflate=None, fit_intercept=True, alpha=0.05):
    """
    Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedNegativeBinomialP.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable containing overdispersed count data (non-negative integers).
        x (list[list]): Independent variables for the count process. Each column represents a predictor.
        x_inflate (list[list], optional): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x. Default is None.
        fit_intercept (bool, optional): If true, adds an intercept term to both the count and zero-inflation processes. Default is True.
        alpha (float, optional): Significance level for confidence intervals (must be between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with model results and statistics, or error string.
    """
    def to2d(arr):
        return [[arr]] if not isinstance(arr, list) else arr

    def validate_numeric_2d(arr, name):
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if not arr:
            return f"Invalid input: {name} cannot be empty."
        if not all(isinstance(row, list) for row in arr):
            return f"Invalid input: {name} must be a 2D list."
        for i, row in enumerate(arr):
            if not row:
                return f"Invalid input: {name} row {i} cannot be empty."
            for j, val in enumerate(row):
                if isinstance(val, bool) or not isinstance(val, (int, float)):
                    return f"Invalid input: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Invalid input: {name}[{i}][{j}] must be finite."
        return None

    # Normalize inputs
    y = to2d(y)
    x = to2d(x)
    if x_inflate is not None:
        x_inflate = to2d(x_inflate)

    # Validate inputs
    error = validate_numeric_2d(y, "y")
    if error:
        return error
    error = validate_numeric_2d(x, "x")
    if error:
        return error
    if x_inflate is not None:
        error = validate_numeric_2d(x_inflate, "x_inflate")
        if error:
            return error

    # Validate y is column vector
    if len(y[0]) != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Check dimensions
    n_obs = len(y)
    if len(x) != n_obs:
        return f"Invalid input: y and x must have same number of rows ({n_obs} vs {len(x)})."
    if x_inflate is not None and len(x_inflate) != n_obs:
        return f"Invalid input: y and x_inflate must have same number of rows ({n_obs} vs {len(x_inflate)})."

    # Validate alpha
    if not isinstance(alpha, (int, float)):
        return "Invalid input: alpha must be numeric."
    if math.isnan(alpha) or math.isinf(alpha):
        return "Invalid input: alpha must be finite."
    if alpha <= 0 or alpha >= 1:
        return "Invalid input: alpha must be between 0 and 1."

    # Validate fit_intercept
    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be boolean."

    # Convert to flat lists
    y_vec = [row[0] for row in y]

    # Validate y contains non-negative integers
    for i, val in enumerate(y_vec):
        if val < 0:
            return f"Invalid input: y[{i}] must be non-negative."
        if val % 1 != 0:
            return f"Invalid input: y[{i}] must be an integer count."

    # Build design matrices
    try:
        y_arr = np.array(y_vec)
        x_arr = np.array(x)

        if fit_intercept:
            x_arr = np.column_stack([np.ones(n_obs), x_arr])

        if x_inflate is None:
            x_inflate_arr = x_arr
        else:
            x_inflate_arr = np.array(x_inflate)
            if fit_intercept:
                x_inflate_arr = np.column_stack([np.ones(n_obs), x_inflate_arr])

        # Fit model
        model = statsmodels_zinb(y_arr, x_arr, exog_infl=x_inflate_arr)

        # Try fitting with different methods, suppressing convergence warnings
        result = None
        for method in ['bfgs', 'newton', 'nm']:
            try:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    result = model.fit(disp=False, method=method, maxiter=5000, warn_convergence=False)
                # Check if we got valid standard errors
                if result.bse is not None and not np.any(np.isnan(result.bse)):
                    break
            except Exception:
                continue

        if result is None:
            return "Model fitting error: Unable to fit model with any optimization method."

        # Extract parameters
        n_count = x_arr.shape[1]
        count_params = result.params[:n_count]
        inflate_params = result.params[n_count:-1]  # Exclude dispersion
        dispersion = result.params[-1]

        # Get confidence intervals - handle case where Hessian inversion failed
        try:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                conf_int = result.conf_int(alpha=alpha)
        except Exception:
            conf_int = np.full((len(result.params), 2), '')

        # Get standard errors, z-stats, p-values - handle NaN
        def safe_extract(arr, idx):
            if arr is None:
                return ''
            if idx >= len(arr):
                return ''
            val = arr[idx]
            if isinstance(val, (int, float)):
                if math.isnan(val) or math.isinf(val):
                    return ''
                return float(val)
            return ''

        bse = result.bse if result.bse is not None else []
        tvalues = result.tvalues if result.tvalues is not None else []
        pvalues = result.pvalues if result.pvalues is not None else []

        # Build output
        output = []

        # Section 1: Count process header
        output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Section 1: Count process rows
        for i in range(n_count):
            param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
            if fit_intercept and i > 0:
                param_name = f'x{i}'
            ci_lower = ''
            ci_upper = ''
            if len(conf_int) > i:
                ci_lower = float(conf_int[i, 0]) if isinstance(conf_int[i, 0], (int, float)) and not math.isnan(conf_int[i, 0]) else ''
                ci_upper = float(conf_int[i, 1]) if isinstance(conf_int[i, 1], (int, float)) and not math.isnan(conf_int[i, 1]) else ''
            output.append([
                '',
                param_name,
                float(count_params[i]),
                safe_extract(bse, i),
                safe_extract(tvalues, i),
                safe_extract(pvalues, i),
                ci_lower,
                ci_upper
            ])

        # Section 2: Inflate process header
        output.append(['inflate_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Section 2: Inflate process rows
        n_inflate = len(inflate_params)
        for i in range(n_inflate):
            param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
            if fit_intercept and i > 0:
                param_name = f'x{i}'
            idx = n_count + i
            ci_lower = ''
            ci_upper = ''
            if len(conf_int) > idx:
                ci_lower = float(conf_int[idx, 0]) if isinstance(conf_int[idx, 0], (int, float)) and not math.isnan(conf_int[idx, 0]) else ''
                ci_upper = float(conf_int[idx, 1]) if isinstance(conf_int[idx, 1], (int, float)) and not math.isnan(conf_int[idx, 1]) else ''
            output.append([
                '',
                param_name,
                float(inflate_params[i]),
                safe_extract(bse, idx),
                safe_extract(tvalues, idx),
                safe_extract(pvalues, idx),
                ci_lower,
                ci_upper
            ])

        # Model statistics
        output.append(['model_statistics', '', '', '', '', '', '', ''])
        output.append(['log_likelihood', float(result.llf) if hasattr(result, 'llf') else '', '', '', '', '', '', ''])
        output.append(['aic', float(result.aic) if hasattr(result, 'aic') else '', '', '', '', '', '', ''])
        output.append(['bic', float(result.bic) if hasattr(result, 'bic') else '', '', '', '', '', '', ''])
        output.append(['dispersion', float(dispersion), '', '', '', '', '', ''])

        return output

    except Exception as e:
        return f"Model fitting error: {str(e)}"

Online Calculator